AITopics | better data

Collaborating Authors

better data

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning

Zhao, Yike, Guo, Simin, Yang, Ziqing, Han, Shifan, Lin, Dahua, Tan, Fei

arXiv.org Artificial IntelligenceOct-9-2025

The reasoning capabilities of Large Language Models (LLMs) play a critical role in many downstream tasks, yet depend strongly on the quality of training data. Despite various proposed data construction methods, their practical utility in real-world pipelines remains underexplored. In this work, we conduct a comprehensive analysis of open-source datasets and data synthesis techniques for mathematical reasoning, evaluating them under a unified pipeline designed to mirror training and deployment scenarios. We further distill effective data selection strategies and identify practical methods suitable for industrial applications. Our findings highlight that structuring data in more interpretable formats, or distilling from stronger models often outweighs simply scaling up data volume. This study provides actionable guidance for integrating training data to enhance LLM capabilities, supporting both cost-effective data curation and scalable model enhancement. We hope this work will inspire further research on how to balance "more data" versus "better data" for real-world reasoning tasks.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.07169

Country: Asia (0.68)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

There's no Data Like Better Data: Using QE Metrics for MT Data Filtering

Peter, Jan-Thorsten, Vilar, David, Deutsch, Daniel, Finkelstein, Mara, Juraska, Juraj, Freitag, Markus

arXiv.org Artificial IntelligenceNov-9-2023

Quality Estimation (QE), the evaluation of machine translation output without the need of explicit references, has seen big improvements in the last years with the use of neural metrics. In this paper we analyze the viability of using QE metrics for filtering out bad quality sentence pairs in the training data of neural machine translation systems~(NMT). While most corpus filtering methods are focused on detecting noisy examples in collections of texts, usually huge amounts of web crawled data, QE models are trained to discriminate more fine-grained quality differences. We show that by selecting the highest quality sentence pairs in the training data, we can improve translation quality while reducing the training size by half. We also provide a detailed analysis of the filtering results, which highlights the differences between both approaches.

better data, mt data filtering, qe metric

arXiv.org Artificial Intelligence

2311.0535

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

AI expert Meredith Broussard: 'Racism, sexism and ableism are systemic problems'

The GuardianMar-26-2023, 14:00:19 GMT

Meredith Broussard is a data journalist and academic whose research focuses on bias in artificial intelligence (AI). She has been in the vanguard of raising awareness and sounding the alarm about unchecked AI. Her previous book, Artificial Unintelligence (2018), coined the term "technochauvinism" to describe the blind belief in the superiority of tech solutions to solve our problems. She appeared in the Netflix documentary Coded Bias (2020), which explores how algorithms encode and propagate discrimination. Her new book is More Than a Glitch: Confronting Race, Gender and Ability Bias in Tech.

algorithm, broussard, meredith broussard, (13 more...)

The Guardian

Country: North America > United States > New York (0.05)

Industry:

Law > Civil Rights & Constitutional Law (0.53)
Media > News (0.36)
Health & Medicine > Therapeutic Area > Oncology (0.31)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.52)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.50)

Add feedback

Hippo Insurance CTO insurtech predictions for 2023

#artificialintelligenceJan-16-2023, 17:55:33 GMT

As we welcome the new year, it's natural to reflect on the year that passed and look ahead to the challenges and opportunities that lie ahead, and more specifically how new technologies might impact the insurance industry. As always, we must separate the signal from the noise. For many, artificial intelligence is a perennial buzzword, but paradoxically, it appears the technology is largely still in its infancy in the insurance industry, and especially in the home insurance space. Regulators and insurers alike are understandably grappling with challenges created by the lack of model explainability, presenting challenges for the widespread use of AI to directly evaluate and price risk for homeowners insurance in the near future. Instead, major technological innovation in homeowners insurance in the coming year will likely come from solutions and tools designed to improve the ingestion and processing of data in ways that positively impact the consumer experience throughout their homeownership journey.

artificial intelligence, banking & finance, insurance, (13 more...)

#artificialintelligence

Industry: Banking & Finance > Insurance (1.00)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

Big Tech builds AI with bad data. So scientists sought better data.

#artificialintelligenceDec-6-2022, 19:14:47 GMT

Yacine Jernite's fears about bias in artificial intelligence were vividly affirmed in 2017, when a Facebook translation error led Israeli police to arrest a Palestinian construction worker. The man had posted a picture of himself leaning against a bulldozer with the caption, in Arabic, "good morning." Facebook mistakenly translated it, in Hebrew, as "attack them." The error was quickly discovered and the man released, according to a report in Haaretz, but the incident cemented personal concerns about AI for Jernite, who joined Facebook's AI division soon after. As the child of Moroccan parents in post-9/11 America, Jernite said he has "spent hours upon hours in immigration secondary interviews -- in a way that I could not at the time trace to the technology that was being applied."

bigscience, jernite, native language speaker, (14 more...)

#artificialintelligence

Country:

Europe > France (0.15)
North America > United States > New York (0.05)
North America > United States > California > San Francisco County > San Francisco (0.05)
(3 more...)

Industry:

Information Technology > Services (0.49)
Government > Regional Government (0.35)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.51)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.49)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.34)

Add feedback

Data Centric Artificial Intelligence

#artificialintelligenceAug-28-2022, 23:55:27 GMT

The data-centric artificial intelligence is the modern approach to building AI systems using quality data. The data-centric AI prioritizes the quality of data over the quantity of data, while traditional model-centric AI does the opposite. The key is better data, not big data! The key idea of data-centric AI is to handle data the same way as handling high-quality materials when building a house i.e. spend relatively more time labelling, augmenting, managing and curating the data. The traditional way is to optimize the highly parameterized models using big data and achieve high performance.

data centric artificial intelligence, data-centric ai, iterative process, (11 more...)

#artificialintelligence

Technology:

Information Technology > Data Science > Data Mining (0.59)
Information Technology > Data Science > Data Quality (0.55)
Information Technology > Artificial Intelligence > Machine Learning (0.53)

Add feedback

Big Tech builds AI with bad data. So scientists sought better data.

#artificialintelligenceJul-31-2022, 06:52:51 GMT

Now Jernite, 33, is trying to push AI in a better direction. After leaving Facebook, he joined BigScience, a global effort by 1,000 researchers in 60 countries to build a more transparent, accountable AI, with less of the bias that infects so many Big Tech initiatives. The largely volunteer effort trained a computer system with good data that was curated by humans from different cultures, rather than readily available data scraped from the internet, written mostly in English, and riddled with harmful speech on race, gender and religion. The resulting AI was released on July 12 for researchers to download and study.

better data, big tech build ai, scientist, (1 more...)

#artificialintelligence

Technology:

Information Technology > Communications > Social Media (0.63)
Information Technology > Software > Programming Languages (0.40)
Information Technology > Data Science > Data Quality (0.40)
Information Technology > Artificial Intelligence > Natural Language (0.40)

Add feedback

It's about better data, not big data, deep learning pioneer Ng says

#artificialintelligenceJul-22-2022, 22:19:17 GMT

Billionaires are losing their fortunes, but it isn't just because of the stock rout--some of them are giving their money away

better data, big data, deep learning pioneer ng

#artificialintelligence

Technology:

Information Technology > Data Science > Data Mining > Big Data (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.40)

Add feedback

Every Business Can Work More Efficiently With Better Data

#artificialintelligenceApr-13-2022, 09:40:16 GMT

This 10-course bundle can get you up to speed on today's top data technologies and will help you better utilize data in your own business's day-to-day. The courses are all taught by Zenva Academy (4.4/5-star instructor rating), one of the premier online learning destinations. While the bundle covers a number of technologies, it gives special emphasis to Python. Python is the world's most popular programming language because of its general-purpose language that focuses on readability and extensibility. Because it's so flexible, it's used in everything from bulk mathematical calculation to web and mobile backends to machine learning. It's an essential tool to learn if you want to work with massive amounts of data and this bundle will introduce you to Python before giving you practical instruction in working with Python to read data from APIs, process images, work with Python Turtle, visualize data in many ways, build a game, and more.

better data, better utilize data, python, (1 more...)

#artificialintelligence

Genre: Instructional Material > Course Syllabus & Notes (1.00)

Technology:

Information Technology > Software > Programming Languages (0.62)
Information Technology > Artificial Intelligence > Machine Learning (0.42)

Add feedback

MIT Researcher Explores The Downside Of Machine Learning In Healthcare - Liwaiwai

#artificialintelligenceFeb-6-2022, 01:41:14 GMT

While working toward her dissertation in Computer Science, Marzyeh Ghassemi PhD '17 wrote some papers on how machine learning techniques from AI could be applied to clinical data in order to predict patient outcomes. "It wasn't until the end of my PhD work that one of my committee members asked: 'Did you ever check to see how well your model worked across different groups of people?'" That question was eye-opening for Ghassemi, who had previously assessed the performance of models in aggregate, across all patients. Upon a closer look, she saw that models often worked differently, specifically worse, for minorities like black women--a revelation that took her by surprise. "I hadn't made the connection beforehand that health disparities would translate directly to model disparities," she says. "And given that I am a visible minority woman-identifying computer scientist at MIT, I am reasonably certain that many others weren't aware of this either."

ghassemi, healthcare, mit researcher explore, (12 more...)

#artificialintelligence

Genre: Research Report (0.36)

Industry: Health & Medicine > Diagnostic Medicine (0.36)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback